In [ ]:
 
In [ ]:
 
In [ ]:
 

Amyotrophic lateral sclerosis (ALS)¶

About Dataset

Voice database was collected in Republican Research and Clinical Center of Neurology and Neurosurgery (Minsk, Belarus). It consists of 128 sustained vowel phonations (64 of vowel /a/ and 64 of vowel /i/) from 64 speakers, 31 of which were diagnosed with ALS. Each speaker was asked to produce sustained phonation of vowels /a/ and /i/ at a comfortable pitch and loudness as constant and long as possible. It can be seen that voice database is almost balanced and contains 48% of pathological voices and 52% of healthy voices.

The age of the 17 male patients ranges from 40 to 69 (mean 61.1 ± 7.7) and the age of the 14 female patients ranges from 39 to 70 (mean 57.3 ± 7.8). For the case of healthy controls (HC), the age of the 13 men ranges from 34 to 80 (mean 50.2 ± 13.8) and the age of the 20 females ranges from 37 to 68 (mean 56.1 ± 9.7). The samples were recorded at 44.1 kHz using different smartphones with a regular headsets and stored as 16 bit uncompressed PCM files. Average duration of the records in the HC group was 3.7 ± 1.5 s, and in ALS group 4.1 ± 2.0 s. The detailed information about ALS patients is presented in the article

data source link: https://www.kaggle.com/datasets/daniilkrasnoproshin/amyotrophic-lateral-sclerosis-als?select=Minsk2020_ALS_dataset.csv

In [4]:
#importing of main libriries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
import warnings
warnings.filterwarnings('ignore')
In [5]:
# import the main and usefull libraris
from sklearn import tree
from sklearn.svm import SVC
from sklearn.compose import ColumnTransformer
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import  Pipeline
from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.model_selection import train_test_split,GridSearchCV,RandomizedSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold,StratifiedKFold,LeaveOneOut,ShuffleSplit
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,BaggingClassifier,AdaBoostClassifier,GradientBoostingClassifier
from sklearn.metrics import accuracy_score,classification_report
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.feature_selection import SelectKBest,f_classif
In [6]:
# import the data set
df = pd.read_csv('Minsk2020_ALS_dataset.csv')
In [7]:
#first five rows of data 
df.head()
Out[7]:
ID Sex Age J1_a J3_a J5_a J55_a S1_a S3_a S5_a ... dCCi(7) dCCi(8) dCCi(9) dCCi(10) dCCi(11) dCCi(12) d_1 F2_i F2_{conv} Diagnosis (ALS)
0 8 M 58 0.321817 0.141230 0.199128 0.923634 6.044559 3.196477 3.770575 ... -0.024467 -0.005300 0.051874 -0.037710 -0.026549 -0.021149 4.825476 2526.285657 833.498083 1
1 20 F 57 0.344026 0.177032 0.206458 0.827714 1.967728 0.856639 1.179851 ... 0.002485 -0.004535 -0.000225 -0.006977 -0.012510 0.014773 5.729322 1985.712014 561.802625 1
2 21 F 58 0.264740 0.148228 0.177078 0.532566 1.850893 0.942743 1.071950 ... -0.013927 0.007908 0.007960 -0.009022 -0.012488 -0.015588 8.258488 2364.695972 796.723440 1
3 22 F 70 0.455793 0.174870 0.243660 0.962641 2.883768 1.284926 1.915058 ... -0.019285 -0.021768 0.020495 0.035976 -0.034648 0.008021 5.447137 1860.172768 359.409974 1
4 24 M 66 0.269335 0.143961 0.167465 0.547745 2.327924 1.164109 1.420891 ... -0.005743 0.004726 -0.015247 0.003900 -0.007686 -0.003784 8.562517 2051.627447 817.111847 1

5 rows × 135 columns

In [8]:
# last five rows of  data
df.tail()
Out[8]:
ID Sex Age J1_a J3_a J5_a J55_a S1_a S3_a S5_a ... dCCi(7) dCCi(8) dCCi(9) dCCi(10) dCCi(11) dCCi(12) d_1 F2_i F2_{conv} Diagnosis (ALS)
59 123 M 43 0.255799 0.123679 0.182658 0.505591 6.222031 2.876602 3.894294 ... 0.220533 0.089766 -0.120838 -0.004221 -0.013165 0.004642 9.855665 3128.341308 1990.937097 0
60 125 M 63 0.513175 0.296489 0.334845 0.729804 9.686563 4.327943 5.687977 ... 0.028016 -0.038739 0.011588 -0.011281 -0.004294 0.011239 11.094558 1964.218942 601.076046 0
61 127 F 67 0.383901 0.245923 0.251359 0.415136 4.148414 2.069757 2.527213 ... 0.011685 0.007883 -0.014839 0.013859 0.011145 0.001418 12.564742 2526.285657 934.343638 0
62 129 F 68 1.336216 0.815757 0.733197 0.981928 11.224542 5.295879 6.994751 ... 0.015712 0.013437 0.025113 0.008852 -0.010132 -0.008458 10.670669 3201.250289 2284.051658 0
63 131 F 60 0.916706 0.566121 0.512857 1.467165 6.372832 3.251168 3.539229 ... -0.046235 0.041946 -0.065313 -0.016682 0.061026 -0.005883 6.972152 2792.655884 1518.529172 0

5 rows × 135 columns

EDA

  • get informtion of the datatypes
  • if there is any class data in the features,convert them into binary from
  • Look how ages of the data are looking and get some information
  • get the descriptive statistics of the above data
  • construct histogram, boxplot for every column using For loop
  • get the correlation heatmap and find if there is association between certain columns
  • choose the specific columns which have certain association and get the pairplot
In [10]:
# information of data and data types (using for loop)
for i  in df.columns:
    print(f"'{i}'  data type: {df[i].dtypes}")
'ID'  data type: int64
'Sex'  data type: object
'Age'  data type: int64
'J1_a'  data type: float64
'J3_a'  data type: float64
'J5_a'  data type: float64
'J55_a'  data type: float64
'S1_a'  data type: float64
'S3_a'  data type: float64
'S5_a'  data type: float64
'S11_a'  data type: float64
'S55_a'  data type: float64
'DPF_a'  data type: float64
'PFR_a'  data type: float64
'PPE_a'  data type: float64
'PVI_a'  data type: float64
'HNR_a'  data type: float64
'GNEa_{\mu}'  data type: float64
'GNEa_{\sigma}'  data type: float64
'Ha(1)_{mu}'  data type: float64
'Ha(2)_{mu}'  data type: float64
'Ha(3)_{mu}'  data type: float64
'Ha(4)_{mu}'  data type: float64
'Ha(5)_{mu}'  data type: float64
'Ha(6)_{mu}'  data type: float64
'Ha(7)_{mu}'  data type: float64
'Ha(8)_{mu}'  data type: float64
'Ha(1)_{sd}'  data type: float64
'Ha(2)_{sd}'  data type: float64
'Ha(3)_{sd}'  data type: float64
'Ha(4)_{sd}'  data type: float64
'Ha(5)_{sd}'  data type: float64
'Ha(6)_{sd}'  data type: float64
'Ha(7)_{sd}'  data type: float64
'Ha(8)_{sd}'  data type: float64
'Ha(1)_{rel}'  data type: float64
'Ha(2)_{rel}'  data type: float64
'Ha(3)_{rel}'  data type: float64
'Ha(4)_{rel}'  data type: float64
'Ha(5)_{rel}'  data type: float64
'Ha(6)_{rel}'  data type: float64
'Ha(7)_{rel}'  data type: float64
'Ha(8)_{rel}'  data type: float64
'CCa(1)'  data type: float64
'CCa(2)'  data type: float64
'CCa(3)'  data type: float64
'CCa(4)'  data type: float64
'CCa(5)'  data type: float64
'CCa(6)'  data type: float64
'CCa(7)'  data type: float64
'CCa(8)'  data type: float64
'CCa(9)'  data type: float64
'CCa(10)'  data type: float64
'CCa(11)'  data type: float64
'CCa(12)'  data type: float64
'dCCa(1)'  data type: float64
'dCCa(2)'  data type: float64
'dCCa(3)'  data type: float64
'dCCa(4)'  data type: float64
'dCCa(5)'  data type: float64
'dCCa(6)'  data type: float64
'dCCa(7)'  data type: float64
'dCCa(8)'  data type: float64
'dCCa(9)'  data type: float64
'dCCa(10)'  data type: float64
'dCCa(11)'  data type: float64
'dCCa(12)'  data type: float64
'J1_i'  data type: float64
'J3_i'  data type: float64
'J5_i'  data type: float64
'J55_i'  data type: float64
'S1_i'  data type: float64
'S3_i'  data type: float64
'S5_i'  data type: float64
'S11_i'  data type: float64
'S55_i'  data type: float64
'DPF_i'  data type: float64
'PFR_i'  data type: float64
'PPE_i'  data type: float64
'PVI_i'  data type: float64
'HNR_i'  data type: float64
'GNEi_{\mu}'  data type: float64
'GNEi_{\sigma}'  data type: float64
'Hi(1)_{mu}'  data type: float64
'Hi(2)_{mu}'  data type: float64
'Hi(3)_{mu}'  data type: float64
'Hi(4)_{mu}'  data type: float64
'Hi(5)_{mu}'  data type: float64
'Hi(6)_{mu}'  data type: float64
'Hi(7)_{mu}'  data type: float64
'Hi(8)_{mu}'  data type: float64
'Hi(1)_{sd}'  data type: float64
'Hi(2)_{sd}'  data type: float64
'Hi(3)_{sd}'  data type: float64
'Hi(4)_{sd}'  data type: float64
'Hi(5)_{sd}'  data type: float64
'Hi(6)_{sd}'  data type: float64
'Hi(7)_{sd}'  data type: float64
'Hi(8)_{sd}'  data type: float64
'Hi(1)_{rel}'  data type: float64
'Hi(2)_{rel}'  data type: float64
'Hi(3)_{rel}'  data type: float64
'Hi(4)_{rel}'  data type: float64
'Hi(5)_{rel}'  data type: float64
'Hi(6)_{rel}'  data type: float64
'Hi(7)_{rel}'  data type: float64
'Hi(8)_{rel}'  data type: float64
'CCi(1)'  data type: float64
'CCi(2)'  data type: float64
'CCi(3)'  data type: float64
'CCi(4)'  data type: float64
'CCi(5)'  data type: float64
'CCi(6)'  data type: float64
'CCi(7)'  data type: float64
'CCi(8)'  data type: float64
'CCi(9)'  data type: float64
'CCi(10)'  data type: float64
'CCi(11)'  data type: float64
'CCi(12)'  data type: float64
'dCCi(1)'  data type: float64
'dCCi(2)'  data type: float64
'dCCi(3)'  data type: float64
'dCCi(4)'  data type: float64
'dCCi(5)'  data type: float64
'dCCi(6)'  data type: float64
'dCCi(7)'  data type: float64
'dCCi(8)'  data type: float64
'dCCi(9)'  data type: float64
'dCCi(10)'  data type: float64
'dCCi(11)'  data type: float64
'dCCi(12)'  data type: float64
'd_1'  data type: float64
'F2_i'  data type: float64
'F2_{conv}'  data type: float64
'Diagnosis (ALS)'  data type: int64
In [11]:
del df['ID']
In [12]:
#find out the data shape (how many columns and how many rows are there)
df.shape
Out[12]:
(64, 134)
In [13]:
#describe of data  Like count, Mean, Std, Min, quantile(25%,50%,75%), Max
df.describe().T
Out[13]:
count mean std min 25% 50% 75% max
Age 64.0 56.390625 10.203668 34.000000 50.750000 58.000000 63.250000 80.000000
J1_a 64.0 0.658951 0.724002 0.098881 0.325932 0.458935 0.772783 5.391649
J3_a 64.0 0.379242 0.435636 0.065791 0.172422 0.253976 0.465699 3.217293
J5_a 64.0 0.395886 0.431926 0.092655 0.198274 0.293405 0.476541 3.321567
J55_a 64.0 0.945496 0.791558 0.285497 0.538387 0.698183 1.189025 5.991336
... ... ... ... ... ... ... ... ...
dCCi(12) 64.0 0.001269 0.020800 -0.083977 -0.006534 0.000661 0.009515 0.077897
d_1 64.0 9.164473 2.681449 2.276702 7.604734 9.646564 10.757522 15.420777
F2_i 64.0 2495.116475 617.755856 444.730268 2051.627447 2471.097222 2938.236560 3599.554394
F2_{conv} 64.0 1209.976405 553.694046 48.246203 800.181156 1206.596083 1551.677678 2441.219054
Diagnosis (ALS) 64.0 0.484375 0.503706 0.000000 0.000000 0.000000 1.000000 1.000000

133 rows × 8 columns

In [14]:
# Convert 'Male' to 0, 'Female' to 1 or another option is one hot encoding 

df["Sex"] =df['Sex'].replace({'M': 0, 'F': 1})
In [15]:
# sum of null values
[features for features in df.columns if df[features].isnull().sum()>0]
Out[15]:
[]
In [16]:
#def grouper(df, idx, col):
 #   if 20 <=  df[col].loc[idx] <= 40:
     #   return '20-40'
  #  elif 40 <= df[col].loc[idx] <= 60:
    #    return '40-60'
   # elif 60 <= df[col].loc[idx] <= 80:
    #    return '60-80'
In [17]:
# Define the grouper function
def grouper(df, idx, col):
    age = df[col].loc[idx]
    if 20 <= age < 40:
        return '20-40'
    elif 40 <= age < 60:
        return '40-60'
    elif 60 <= age < 80:
        return '60-80'
    else:
        return 'Other'

This code defines a custom function called grouper that takes three arguments: df (a DataFrame), idx (an index or value), and col (a column name).

The function uses the value from the specified column (col) at the specified index (idx) to determine which age range it belongs to, and returns a string indicating the age range.

The age ranges are defined as:

  • '20-40' if the age is between 20 and 40 (inclusive)
  • '40-60' if the age is between 40 and 60 (inclusive)
  • '60-80' if the age is between 60 and 80 (inclusive)
  • 'Other' for all other ages

This custom function can then be used with the groupby method to group data by these age ranges.

In [19]:
# Group by age range and count for Male
Male = df.Age[df.Sex == '0'].groupby(lambda x: grouper(df, x, 'Age')).count()

# Group by age range and count for Female
Female = df.Age[df.Sex == '1'].groupby(lambda x: grouper(df, x, 'Age')).count()
  1. df.Sex == '0' filters out rows where 'Sex' is not '0'.
  2. df.Age[df.Sex == '0'] applies this filter to the 'Age' column.
  3. groupby(lambda x: grouper(df, x, 'Age')) groups the filtered DataFrame by the custom grouper function, which takes three arguments: the DataFrame, an index, and the 'Age' column.
  4. .count() counts the number of rows in each group.
  5. The result is a Series with index equal to unique values in the 'Age' column and values equal to the count of rows in each group.
In [21]:
# Make sure both series have the same index
#all_categories = sorted(set(Male.index).union(set(Female.index)))
#Male = Male.reindex(all_categories, fill_value=0)
#Female = Female.reindex(all_categories, fill_value=0)
In [22]:
#Plotting
#xpos = np.arange(len(Female.index))

#plt.figure(figsize=(9, 5))
#plt.bar(xpos - 0.2, Male, width=0.2, label='Male')
#plt.bar(xpos, Female, width=0.2, label='Female')
#plt.xticks(xpos + 0.1, Female.index)
#plt.xlabel('Age Group')
#plt.ylabel('Count')
#plt.legend()
#plt.show()
In [23]:
#xpos = np.arange(len(Male.index))

#plt.figure(figsize=(9,5))
#plt.bar(xpos,Male,width=0.2,label = 'Male')
#plt.bar([x for x in 0.2+xpos],Female,width=0.2,label = 'Female')
#plt.xticks(xpos,Male.index)
#plt.legend()
#plt.show()
In [ ]:
 
In [ ]:
 
In [24]:
#asine the featuresand target variable
x= df.drop('Diagnosis (ALS)',axis = 1)
y= df['Diagnosis (ALS)']
In [ ]:
 
In [25]:
# find the best 15 feature columns 
f_clsif = SelectKBest(f_classif,k=15)
f_clsif.fit(x,y)
Out[25]:
SelectKBest(k=15)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SelectKBest(k=15)

Feature Selection using SelectKBest and f_classif

The code uses the SelectKBest class and f_classif function from sklearn.feature_selection to perform feature selection.

Steps:

  1. Create an instance of SelectKBest with f_classif as the scorer and select the top 15 features.
  2. Fit the SelectKBest object to the training data x and target variable y.
  3. The f_classif function ranks features based on their strength of association with the target variable.
  4. The resulting object f_clsif contains the selected features, which can be used for further analysis or modeling.

Result: The selected features can be used for further analysis or modeling.

In [27]:
f_clsif.scores_ # see best 15 columns score
Out[27]:
array([1.51966036e+00, 4.81111518e+00, 3.72374464e+00, 2.80940566e+00,
       3.90272142e+00, 9.29424862e+00, 2.77970662e+00, 2.01078408e+00,
       3.04007079e+00, 4.53402558e+00, 2.83055323e+00, 1.78708514e-01,
       8.44003082e+00, 1.31234403e+01, 1.34691845e+01, 5.06556737e+00,
       7.26673419e+00, 2.37589532e+00, 1.41239019e-02, 6.77236381e-01,
       1.76272828e-01, 8.74198430e-01, 1.35243407e+00, 3.32816856e+00,
       7.10867752e+00, 7.82461970e+00, 2.33069569e+00, 2.94059917e+00,
       8.33835113e-01, 5.35734923e+00, 1.22983870e+00, 4.30339574e+00,
       7.71656069e+00, 4.65540447e+00, 5.47206862e+00, 1.12917647e+00,
       1.79560374e-01, 3.09671496e+00, 1.05684127e-01, 2.29953043e+00,
       1.05110077e+01, 6.42043481e+00, 1.97630163e+00, 1.87314320e+00,
       2.35278161e-01, 1.11248250e-01, 1.98683660e+00, 6.26361322e+00,
       1.21072262e+00, 3.34519979e+00, 3.32612120e-01, 2.59056141e+00,
       4.13056071e+00, 2.17872151e-02, 9.89016079e-03, 2.84651771e-02,
       1.31524937e+00, 2.49446319e+00, 3.05017065e+00, 6.99515361e-05,
       9.78964413e-01, 3.49720955e-01, 2.17440296e+00, 1.66873074e-01,
       6.16654676e-01, 1.15606486e+00, 3.62342157e-01, 6.45571084e-02,
       7.82227429e-01, 4.07988165e+00, 3.85109457e+00, 3.01162320e+00,
       3.97024944e+00, 6.62302552e+00, 4.34054776e+00, 1.95323489e+00,
       1.99930902e+00, 2.23995943e+00, 8.70338731e+00, 9.58352694e+00,
       1.77983577e+00, 5.37275723e+00, 1.11006484e+00, 2.57633485e+00,
       2.04992838e+00, 4.36453825e+00, 4.96125484e+00, 5.55612730e-01,
       7.11245184e-02, 1.86913341e-01, 2.21750071e-01, 5.26258017e+00,
       7.70989834e-01, 1.48838814e-01, 1.67050669e-01, 5.48693896e+00,
       4.33479004e+00, 7.53443314e+00, 8.50624257e+00, 4.15579605e+00,
       3.10833329e+00, 8.33534614e-02, 3.16402718e+00, 3.68208991e-01,
       5.33100220e-02, 2.17765765e-02, 1.31398139e-02, 1.54043674e+01,
       5.21783333e+00, 1.11496792e+00, 3.87499223e-01, 9.87683989e+00,
       4.63868919e+00, 4.99533015e+00, 1.65764791e+00, 1.78842543e-02,
       3.11618711e+00, 4.69388698e-03, 7.55314785e-01, 4.11418269e-02,
       2.40998460e+00, 4.71672103e+00, 3.72038084e+00, 7.11690869e+00,
       1.08582864e+00, 4.40878776e+00, 3.70158009e+00, 2.16007263e+00,
       3.28317149e-02, 3.66352847e+00, 1.63014436e+01, 6.24101762e+00,
       1.11532339e+01])
In [28]:
f_clsif.get_feature_names_out() # see the which column are best 15 columns
Out[28]:
array(['J55_a', 'PFR_a', 'PPE_a', 'PVI_a', 'Ha(8)_{mu}', 'Ha(7)_{sd}',
       'Ha(7)_{rel}', 'PVI_i', 'HNR_i', 'Hi(8)_{sd}', 'Hi(1)_{rel}',
       'CCi(2)', 'CCi(6)', 'd_1', 'F2_{conv}'], dtype=object)
In [29]:
# create a variable asine the feature values
x= df[['Age', 'Sex', 'J55_a', 'PFR_a', 'PPE_a', 'PVI_a', 'Ha(8)_{mu}', 'Ha(7)_{sd}','Ha(7)_{rel}', 'PVI_i', 'HNR_i', 'Hi(8)_{sd}', 'Hi(1)_{rel}','CCi(2)', 'CCi(6)', 'd_1', 'F2_{conv}']]
In [30]:
x.head(20)
Out[30]:
Age Sex J55_a PFR_a PPE_a PVI_a Ha(8)_{mu} Ha(7)_{sd} Ha(7)_{rel} PVI_i HNR_i Hi(8)_{sd} Hi(1)_{rel} CCi(2) CCi(6) d_1 F2_{conv}
0 58 0 0.923634 0.261201 0.953932 0.497905 -32.988570 3.718647 0.031975 0.548737 6.831568 7.702146 0.029578 5.290671 -7.162548 4.825476 833.498083
1 57 1 0.827714 0.214815 0.917383 0.496096 -38.601498 6.869505 0.024816 0.753773 19.656301 5.214603 0.084514 8.013684 6.707502 5.729322 561.802625
2 58 1 0.532566 0.250832 0.697039 0.484256 -30.253636 6.487032 0.042494 0.513454 26.247032 1.973133 0.190498 5.398531 -15.978241 8.258488 796.723440
3 70 1 0.962641 0.394053 1.400853 0.710780 -22.283258 2.293785 0.056093 0.531370 14.497427 2.452554 0.044136 -2.032011 7.383212 5.447137 359.409974
4 66 0 0.547745 0.280716 0.604214 0.464793 -27.578969 3.365784 0.048372 0.418242 24.602762 5.378356 0.174282 12.596246 -8.706192 8.562517 817.111847
5 51 0 0.505987 0.184189 0.630030 0.258841 -31.279294 2.724680 0.030564 0.283531 19.670345 4.352394 0.050627 14.725795 -3.615166 9.810520 1004.727725
6 57 0 0.765986 0.283794 1.390729 0.469168 -38.652722 6.161217 0.025839 0.471937 19.579729 5.490719 0.120715 15.453971 -4.737735 5.945219 1219.744513
7 58 0 1.205596 0.364785 2.034901 0.780017 -24.551761 5.529329 0.058102 0.912703 14.444541 6.353730 0.055619 15.284422 2.900285 8.422353 759.068477
8 67 0 1.951256 0.265532 1.450424 0.576613 -29.173665 3.654008 0.044167 0.556261 16.074488 3.338447 0.053069 4.219037 -9.566137 8.760510 669.022078
9 61 0 0.591160 0.120943 0.912168 0.330002 -30.162942 4.460441 0.029744 0.258268 19.190522 5.583344 0.071087 4.056457 1.257204 7.572111 838.978523
10 67 0 1.889785 0.366062 2.020179 1.696469 -40.625713 10.546459 0.021517 1.272719 8.934965 6.988977 0.103499 -0.569087 8.932630 2.276702 669.461749
11 67 0 1.304613 0.175029 1.802719 0.632997 -39.988576 9.099600 0.023266 0.665707 19.334026 10.915085 0.099554 16.218545 1.186972 10.674106 481.009629
12 50 1 0.454599 0.142044 0.355592 0.209852 -60.830922 3.480109 0.020255 0.223791 29.034071 5.511659 0.279002 12.274549 -19.301494 10.950821 1553.425003
13 63 1 1.813700 0.296273 2.143594 1.441035 -25.271617 2.894641 0.037276 1.393756 11.798070 8.683996 0.049739 8.556412 8.327992 7.029500 1288.920905
14 62 1 1.362995 0.204320 1.806991 0.859439 -12.270083 5.573836 0.045952 1.333404 16.072877 5.995308 0.195044 20.597221 -12.995120 12.862700 1552.852150
15 61 0 1.542224 0.366096 1.498665 0.705755 -32.126539 5.752471 0.025477 0.349088 22.958926 4.806596 0.367825 17.992741 0.557423 7.148809 784.563460
16 58 0 5.991336 2.059883 3.393774 3.152778 -54.128866 20.883980 0.014896 0.258555 13.320835 5.751912 0.044208 1.368346 6.137819 8.008742 583.380671
17 57 0 0.849025 0.113289 1.270080 0.491037 -35.894642 3.474426 0.034735 0.400185 14.512768 5.477708 0.141803 22.479055 1.690098 6.031056 927.063276
18 57 0 0.885479 0.168550 1.021275 0.542347 -32.017422 3.248811 0.032800 0.309031 23.433559 5.230515 0.085109 20.074008 -11.126546 11.136041 930.223353
19 40 0 0.957065 0.166342 1.445147 0.842695 -30.476837 2.687244 0.044030 0.692408 17.079100 3.237483 0.174098 20.023235 1.282640 7.872279 985.160918
In [31]:
# cross check informtion on Sex Like Male and Females
pd.crosstab(x.Sex,y)
Out[31]:
Diagnosis (ALS) 0 1
Sex
0 13 17
1 20 14
In [32]:
# cross check informtion on Sex And Ages
pd.crosstab(x.Age,x.Sex)
Out[32]:
Sex 0 1
Age
34 1 0
35 1 0
37 0 1
38 2 0
39 0 2
40 1 1
41 1 0
43 1 0
45 0 1
49 1 0
50 0 3
51 1 0
52 1 1
53 0 2
55 0 3
57 3 2
58 3 1
59 0 1
60 3 3
61 2 0
62 0 2
63 1 3
64 0 3
65 0 1
66 1 0
67 3 2
68 2 1
69 1 0
70 0 1
80 1 0
In [33]:
# We are seeing how many age people are there using on Countplot
plt.figure(figsize=(12,8))


age=sns.countplot(x='Age', data=x,)

for i in age.containers:
    age.bar_label(i)
No description has been provided for this image

Iterates over each individual bar in the plot and adds a label to each bar, displaying the count value for each bar

In [35]:
# We are seeing how many Genders people are there using on Countplot
plt.figure(figsize=(12,8))


Sex=sns.countplot(x='Sex', data=x,hue='Sex')

for i in Sex.containers:
    Sex.bar_label(i)
    
No description has been provided for this image

The text describes a code snippet that creates a stacked bar chart using the seaborn library. The chart displays the distribution of the 'Sex' column in a DataFrame x. The code:

  1. Creates a count plot for the 'Sex' column with bars colored by the values in the 'Sex' column.
  2. Iterates over the bars in the plot.
  3. Labels each bar with its corresponding value.

The resulting plot shows two bars, one for each unique value in the 'Sex' column, with heights representing the count of observations for each sex.

Histogram plots

In [38]:
#see the histogramplots column wise
for column in x:
    plt.figure(figsize=(10, 6))
    k = df[column].kurt() #kurtosis
    s = df[column].skew() #Skewness
    sns.histplot(df[column], kde=True, label= f'kurtosis: {k} & skewness: {s}')
    plt.title(f'Histogram of {column}')
    plt.legend()
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

The code creates a histogram for each column in a Pandas DataFrame x using the seaborn library. The code:

  • Iterates over each column in the DataFrame
  • Calculates kurtosis and skewness for each column
  • Creates a histogram for each column with a KDE curve overlaid on top
  • Sets the title of each plot to include the column name and kurtosis/skewness values

The resulting plots show the distribution of values in each column, including any outliers or anomalies.

In [40]:
x.hist(figsize=(18,12)); # histogram of all columns
No description has been provided for this image

Box plots

In [42]:
# boxplots for numerical columns with outliers
plt.figure(figsize=(10, 8))
for i,column in enumerate(x):
        plt.subplot(4,5,i+1)
        sns.boxplot(x=x[column])
        plt.title(f'Boxplot of {column}')
        
plt.tight_layout()
plt.show()
No description has been provided for this image
  1. Iterate over each column in the DataFrame using enumerate.
  2. Create a box plot for each column using sns.boxplot.
  3. Set the title of each plot to the column name.

Resulting Plot

  • 20 subplots (4 rows x 5 columns) each showing a box plot for a different column.
  • Each subplot title is the name of the corresponding column.

Useful for

  • Exploring and understanding the distribution of values in each column.
  • Identifying outliers or anomalies in the data.
In [44]:
##boxplot of all columns
x.boxplot(figsize=(18,6))
Out[44]:
<Axes: >
No description has been provided for this image
In [45]:
# create the dictionaries for statistics
mean = {}
std = {}
max = {}
min = {}
var = {}
mode = {}


# Calculate statistics for each column
for i in x.columns:
    mean[i] = x[i].mean()
    std[i] = x[i].std()
    max[i] =  x[i].max()
    min[i] = x[i].min()
    var[i] = x[i].var()
    mode[i] = x[i].mode()
    
    
# printing statistis for each column using with For loop  
    print(f"Column: {i}")
    print(f"Mean: {mean[i]}")
    print(f"Standard Deviation: {std[i]}")
    print(f"Max: {max[i]}")
    print(f"Min: {min[i]}")
    print(f"Var: {var[i]}")
    print(f"Mode: {mode[i]}")
    print('-----'*10)
Column: Age
Mean: 56.390625
Standard Deviation: 10.203667544035643
Max: 80
Min: 34
Var: 104.11483134920636
Mode: 0    60
Name: Age, dtype: int64
--------------------------------------------------
Column: Sex
Mean: 0.53125
Standard Deviation: 0.5029673851018478
Max: 1
Min: 0
Var: 0.25297619047619047
Mode: 0    1
Name: Sex, dtype: int64
--------------------------------------------------
Column: J55_a
Mean: 0.9454963659113814
Standard Deviation: 0.7915578601815897
Max: 5.99133551977937
Min: 0.285496902227754
Var: 0.6265638460152572
Mode: 0     0.285497
1     0.304952
2     0.347673
3     0.363699
4     0.378197
        ...   
59    1.821009
60    1.889785
61    1.951256
62    1.965493
63    5.991336
Name: J55_a, Length: 64, dtype: float64
--------------------------------------------------
Column: PFR_a
Mean: 0.22950795241187177
Standard Deviation: 0.2994210517717801
Max: 2.05988326586503
Min: 0.0614092277753447
Var: 0.08965296624411902
Mode: 0     0.061409
1     0.063347
2     0.063731
3     0.064732
4     0.066136
        ...   
59    0.366096
60    0.394053
61    1.056616
62    1.195964
63    2.059883
Name: PFR_a, Length: 64, dtype: float64
--------------------------------------------------
Column: PPE_a
Mean: 1.04610081933792
Standard Deviation: 0.663761326486491
Max: 3.39377357252711
Min: 0.0397935998137785
Var: 0.44057909853910604
Mode: 0     0.039794
1     0.060351
2     0.118447
3     0.187907
4     0.261817
        ...   
59    2.034901
60    2.143594
61    2.377804
62    2.401591
63    3.393774
Name: PPE_a, Length: 64, dtype: float64
--------------------------------------------------
Column: PVI_a
Mean: 0.48633739351358785
Standard Deviation: 0.44774998289264556
Max: 3.152777509126
Min: -0.0
Var: 0.20048004718036438
Mode: 0   -0.0
Name: PVI_a, dtype: float64
--------------------------------------------------
Column: Ha(8)_{mu}
Mean: -32.58485280214021
Standard Deviation: 13.539554443493675
Max: -5.26183857168348
Min: -64.0606439484734
Var: 183.31953452832929
Mode: 0    -64.060644
1    -63.776113
2    -60.830922
3    -56.807558
4    -55.626682
        ...    
59   -12.952667
60   -12.464765
61   -12.270083
62   -10.431235
63    -5.261839
Name: Ha(8)_{mu}, Length: 64, dtype: float64
--------------------------------------------------
Column: Ha(7)_{sd}
Mean: 4.992635204732386
Standard Deviation: 3.3015583269477715
Max: 20.8839801187164
Min: 1.25072855119363
Var: 10.900287386238169
Mode: 0      1.250729
1      1.405418
2      1.538601
3      1.542359
4      1.606694
        ...    
59     9.196782
60     9.681702
61    10.546459
62    11.902175
63    20.883980
Name: Ha(7)_{sd}, Length: 64, dtype: float64
--------------------------------------------------
Column: Ha(7)_{rel}
Mean: 0.038140245917599305
Standard Deviation: 0.019359151877006125
Max: 0.100367151246796
Min: 0.0148963624373277
Var: 0.00037477676139698973
Mode: 0     0.014896
1     0.015150
2     0.015293
3     0.015833
4     0.016181
        ...   
59    0.073330
60    0.078140
61    0.084842
62    0.088354
63    0.100367
Name: Ha(7)_{rel}, Length: 64, dtype: float64
--------------------------------------------------
Column: PVI_i
Mean: 0.5162929162086269
Standard Deviation: 0.515401293055496
Max: 3.01306679375731
Min: -0.0
Var: 0.26563849288327723
Mode: 0    -0.000000
1     0.122561
2     0.143140
3     0.145695
4     0.175327
        ...   
59    1.272719
60    1.333404
61    1.393756
62    2.739748
63    3.013067
Name: PVI_i, Length: 64, dtype: float64
--------------------------------------------------
Column: HNR_i
Mean: 19.60951260780632
Standard Deviation: 4.9312672149769
Max: 29.0340710410429
Min: 6.83156778933749
Var: 24.317396345506033
Mode: 0      6.831568
1      8.934965
2     10.116282
3     10.299747
4     11.798070
        ...    
59    27.316035
60    27.509402
61    27.958510
62    28.268906
63    29.034071
Name: HNR_i, Length: 64, dtype: float64
--------------------------------------------------
Column: Hi(8)_{sd}
Mean: 5.224679280549857
Standard Deviation: 2.168108949508015
Max: 11.1326713190622
Min: 1.8687080560454
Var: 4.700696416936749
Mode: 0      1.868708
1      1.891246
2      1.973133
3      2.170203
4      2.384895
        ...    
59     8.683996
60     9.199550
61     9.297256
62    10.915085
63    11.132671
Name: Hi(8)_{sd}, Length: 64, dtype: float64
--------------------------------------------------
Column: Hi(1)_{rel}
Mean: 0.15205975398202076
Standard Deviation: 0.11088728740013579
Max: 0.416235733630038
Min: 0.0181452927589843
Var: 0.012295990506960312
Mode: 0     0.018145
1     0.018648
2     0.019288
3     0.029578
4     0.031000
        ...   
59    0.367825
60    0.395132
61    0.396101
62    0.398835
63    0.416236
Name: Hi(1)_{rel}, Length: 64, dtype: float64
--------------------------------------------------
Column: CCi(2)
Mean: 12.668709484671439
Standard Deviation: 8.580686864222983
Max: 26.2731092461923
Min: -14.8975515227695
Var: 73.62818706184883
Mode: 0    -14.897552
1     -9.525669
2     -9.033026
3     -4.440639
4     -2.032011
        ...    
59    22.955553
60    23.228614
61    24.048123
62    24.276828
63    26.273109
Name: CCi(2), Length: 64, dtype: float64
--------------------------------------------------
Column: CCi(6)
Mean: -7.65742962711239
Standard Deviation: 8.095678597262246
Max: 8.93262955204458
Min: -21.7832948089145
Var: 65.54001195017001
Mode: 0    -21.783295
1    -19.547338
2    -19.498076
3    -19.301494
4    -19.212601
        ...    
59     6.137819
60     6.707502
61     7.383212
62     8.327992
63     8.932630
Name: CCi(6), Length: 64, dtype: float64
--------------------------------------------------
Column: d_1
Mean: 9.164472684963322
Standard Deviation: 2.6814486649289866
Max: 15.4207766781969
Min: 2.27670168162266
Var: 7.190166942649444
Mode: 0      2.276702
1      2.512995
2      2.986929
3      4.825476
4      5.218871
        ...    
59    12.862700
60    12.874560
61    12.892692
62    14.651111
63    15.420777
Name: d_1, Length: 64, dtype: float64
--------------------------------------------------
Column: F2_{conv}
Mean: 1209.9764046384803
Standard Deviation: 553.6940461956145
Max: 2441.21905442786
Min: 48.2462034430127
Var: 306577.09679247136
Mode: 0       48.246203
1      177.843734
2      359.409974
3      481.009629
4      482.819916
         ...     
59    2157.871393
60    2210.936432
61    2226.127951
62    2284.051658
63    2441.219054
Name: F2_{conv}, Length: 64, dtype: float64
--------------------------------------------------
In [ ]:
 
In [46]:
num_col = x.select_dtypes(include='number').columns # create the numric columns
num_col= x[num_col].corr() 


#check the correletion numric column
num_col.head(50)
Out[46]:
Age Sex J55_a PFR_a PPE_a PVI_a Ha(8)_{mu} Ha(7)_{sd} Ha(7)_{rel} PVI_i HNR_i Hi(8)_{sd} Hi(1)_{rel} CCi(2) CCi(6) d_1 F2_{conv}
Age 1.000000 0.026966 0.184253 0.136052 0.049242 0.154727 -0.104500 0.186713 -0.197313 -0.001773 -0.291332 0.158820 -0.229568 -0.399001 0.120964 -0.241235 -0.341549
Sex 0.026966 1.000000 -0.214536 -0.163947 -0.232222 -0.220055 -0.287087 -0.013304 -0.133471 0.096510 0.308280 0.129129 0.235512 -0.174748 -0.257726 0.167690 0.306281
J55_a 0.184253 -0.214536 1.000000 0.817227 0.763242 0.861297 -0.179818 0.669413 -0.195084 0.245381 -0.452905 0.208691 -0.289631 -0.224258 0.388139 -0.211375 -0.209832
PFR_a 0.136052 -0.163947 0.817227 1.000000 0.456137 0.686740 -0.209699 0.651039 -0.237686 0.011550 -0.283053 0.137311 -0.244529 -0.403757 0.381911 -0.285186 -0.331199
PPE_a 0.049242 -0.232222 0.763242 0.456137 1.000000 0.737692 -0.127952 0.551722 -0.174568 0.541317 -0.496916 0.320599 -0.372266 0.000599 0.429114 -0.189690 -0.068347
PVI_a 0.154727 -0.220055 0.861297 0.686740 0.737692 1.000000 -0.145881 0.606170 -0.173399 0.316831 -0.457349 0.196170 -0.257760 -0.207569 0.449964 -0.261770 -0.209879
Ha(8)_{mu} -0.104500 -0.287087 -0.179818 -0.209699 -0.127952 -0.145881 1.000000 -0.453556 0.712990 -0.084679 -0.112767 -0.485238 0.023020 0.398305 0.133397 0.138015 0.093605
Ha(7)_{sd} 0.186713 -0.013304 0.669413 0.651039 0.551722 0.606170 -0.453556 1.000000 -0.532977 0.204290 -0.260853 0.396057 -0.301603 -0.383573 0.223843 -0.284878 -0.199028
Ha(7)_{rel} -0.197313 -0.133471 -0.195084 -0.237686 -0.174568 -0.173399 0.712990 -0.532977 1.000000 -0.140939 0.098513 -0.477790 0.123904 0.398123 -0.059114 0.359289 0.285354
PVI_i -0.001773 0.096510 0.245381 0.011550 0.541317 0.316831 -0.084679 0.204290 -0.140939 1.000000 -0.347194 0.298143 -0.290207 -0.022991 0.057376 -0.125869 0.029502
HNR_i -0.291332 0.308280 -0.452905 -0.283053 -0.496916 -0.457349 -0.112767 -0.260853 0.098513 -0.347194 1.000000 -0.185386 0.599915 0.238966 -0.389983 0.345107 0.294596
Hi(8)_{sd} 0.158820 0.129129 0.208691 0.137311 0.320599 0.196170 -0.485238 0.396057 -0.477790 0.298143 -0.185386 1.000000 -0.208470 -0.240840 0.097976 -0.261030 -0.048364
Hi(1)_{rel} -0.229568 0.235512 -0.289631 -0.244529 -0.372266 -0.257760 0.023020 -0.301603 0.123904 -0.290207 0.599915 -0.208470 1.000000 0.269096 -0.109771 0.069119 0.281251
CCi(2) -0.399001 -0.174748 -0.224258 -0.403757 0.000599 -0.207569 0.398305 -0.383573 0.398123 -0.022991 0.238966 -0.240840 0.269096 1.000000 -0.197918 0.541387 0.564874
CCi(6) 0.120964 -0.257726 0.388139 0.381911 0.429114 0.449964 0.133397 0.223843 -0.059114 0.057376 -0.389983 0.097976 -0.109771 -0.197918 1.000000 -0.559611 -0.372608
d_1 -0.241235 0.167690 -0.211375 -0.285186 -0.189690 -0.261770 0.138015 -0.284878 0.359289 -0.125869 0.345107 -0.261030 0.069119 0.541387 -0.559611 1.000000 0.498940
F2_{conv} -0.341549 0.306281 -0.209832 -0.331199 -0.068347 -0.209879 0.093605 -0.199028 0.285354 0.029502 0.294596 -0.048364 0.281251 0.564874 -0.372608 0.498940 1.000000
In [ ]:
 
In [47]:
plt.figure (figsize=(12,10))
sns.heatmap(data=num_col,annot=True,fmt='.2g',linewidths=1,cmap='coolwarm', square=True) 
plt.tight_layout()
No description has been provided for this image
In [48]:
plt.figure(figsize=(18,18))
imp_col = df[['Age', 'Sex', 'J55_a', 'PFR_a', 'PPE_a', 'PVI_a', 'Ha(8)_{mu}', 'Ha(7)_{sd}','Ha(7)_{rel}', 'PVI_i', 'HNR_i', 'Hi(8)_{sd}', 'Hi(1)_{rel}','CCi(2)', 'CCi(6)', 'd_1', 'F2_{conv}','Diagnosis (ALS)']]
sns.pairplot(data=imp_col,hue='Diagnosis (ALS)')
plt.show()
<Figure size 1800x1800 with 0 Axes>
No description has been provided for this image
In [ ]:
 
In [ ]:
 

Models buldings using for loop

In [50]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=12,stratify=y)

Splitting a Dataset

  • The code uses train_test_split from scikit-learn to split a dataset into training and testing sets.
  • The arguments are:
    • x and y: input features and target variables, respectively.
    • test_size=0.2: 20% of data for testing, 80% for training.
    • random_state=12: ensures reproducibility of the split.
    • stratify=y: ensures class proportions in training and testing sets match the original dataset.
  • The function returns four arrays:
    • x_train: training features (80%).
    • x_test: testing features (20%).
    • y_train: training target variable.
    • y_test: testing target variable.
  • This process allows evaluation of model performance on unseen data and avoids overfitting.
In [52]:
#create the variable use of model bulding
model = [
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    GradientBoostingClassifier(),
    BaggingClassifier(),
    AdaBoostClassifier(),
    LogisticRegression(),
    SVC(),
    KNeighborsClassifier(),
    GaussianNB()
]
In [ ]:
 

KFOLD

In [54]:
for i in model:
    kfold = KFold(8)
    score = cross_val_score(estimator=i,X=x,y=y,cv=kfold)
    print(i)
    print(f'score:',score)
    print(f'mean_value:',score.mean())
    print(f'min_value:',score.min())
    print('------'*10)
DecisionTreeClassifier()
score: [0.625 0.875 0.75  0.625 0.5   0.625 1.    0.875]
mean_value: 0.734375
min_value: 0.5
------------------------------------------------------------
RandomForestClassifier()
score: [0.5   0.75  1.    0.625 1.    0.5   1.    0.625]
mean_value: 0.75
min_value: 0.5
------------------------------------------------------------
GradientBoostingClassifier()
score: [0.625 0.875 0.5   0.625 0.375 0.625 1.    0.75 ]
mean_value: 0.671875
min_value: 0.375
------------------------------------------------------------
BaggingClassifier()
score: [0.625 0.75  0.875 0.625 0.625 0.625 1.    0.625]
mean_value: 0.71875
min_value: 0.625
------------------------------------------------------------
AdaBoostClassifier()
score: [0.625 0.875 0.75  0.75  0.5   0.75  1.    0.375]
mean_value: 0.703125
min_value: 0.375
------------------------------------------------------------
LogisticRegression()
score: [0.625 0.75  0.5   0.25  1.    0.5   0.875 0.375]
mean_value: 0.609375
min_value: 0.25
------------------------------------------------------------
SVC()
score: [0.25  0.625 0.5   0.25  0.625 0.5   0.75  0.5  ]
mean_value: 0.5
min_value: 0.25
------------------------------------------------------------
KNeighborsClassifier()
score: [0.375 0.25  0.875 0.375 0.625 0.625 0.375 0.25 ]
mean_value: 0.46875
min_value: 0.25
------------------------------------------------------------
GaussianNB()
score: [0.5   0.75  0.875 0.625 1.    0.75  1.    0.75 ]
mean_value: 0.78125
min_value: 0.5
------------------------------------------------------------

¶

  • Kfold not good for this models
  • Kfold result vary poor and bad score given this data set
In [ ]:
 

STRATIFIEDKFOLD

In [57]:
for i in model:
    skf= StratifiedKFold(5)
    score2 = cross_val_score(estimator=i,X=x,y=y,cv=skf)
    print(i)
    print(f'score:',score2)
    print(f'mean_value:',score2.mean())
    print(f'min_value:',score2.min())
    print('------'*10)
DecisionTreeClassifier()
score: [0.61538462 0.61538462 0.84615385 0.92307692 0.66666667]
mean_value: 0.7333333333333333
min_value: 0.6153846153846154
------------------------------------------------------------
RandomForestClassifier()
score: [0.76923077 0.61538462 0.92307692 1.         0.75      ]
mean_value: 0.8115384615384615
min_value: 0.6153846153846154
------------------------------------------------------------
GradientBoostingClassifier()
score: [0.76923077 0.61538462 0.92307692 1.         0.83333333]
mean_value: 0.8282051282051281
min_value: 0.6153846153846154
------------------------------------------------------------
BaggingClassifier()
score: [0.69230769 0.61538462 0.84615385 1.         0.91666667]
mean_value: 0.8141025641025641
min_value: 0.6153846153846154
------------------------------------------------------------
AdaBoostClassifier()
score: [0.69230769 0.76923077 0.84615385 0.76923077 0.66666667]
mean_value: 0.7487179487179487
min_value: 0.6666666666666666
------------------------------------------------------------
LogisticRegression()
score: [0.69230769 0.76923077 0.53846154 0.76923077 0.66666667]
mean_value: 0.6871794871794872
min_value: 0.5384615384615384
------------------------------------------------------------
SVC()
score: [0.84615385 0.84615385 0.76923077 0.69230769 0.5       ]
mean_value: 0.7307692307692308
min_value: 0.5
------------------------------------------------------------
KNeighborsClassifier()
score: [0.84615385 0.76923077 0.69230769 0.61538462 0.5       ]
mean_value: 0.6846153846153846
min_value: 0.5
------------------------------------------------------------
GaussianNB()
score: [0.76923077 0.69230769 0.92307692 0.92307692 0.75      ]
mean_value: 0.8115384615384617
min_value: 0.6923076923076923
------------------------------------------------------------
  • StratifildKfold also not good this models

  • But 5 models min beter in this movment there are

  • DecisionTreeClassifier mean value :0.75

  • RandomForestClassifier mean value: 0.79

  • GradientBoostingClassifier mean value :0.81

  • BaggingClassifier mean value :0.78

  • AdaBoostClassifier mean value :0.74

In [ ]:
 

LeaveOneOut

In [60]:
LOO = LeaveOneOut()
for i  in  model:
    score3 = cross_val_score(estimator=i ,X=x,y=y,cv=LOO)
    print(i)
    print(f'score:',score3)
    print(f'mean_value:',score3.mean())
    print('------'*10)
DecisionTreeClassifier()
score: [1. 1. 0. 1. 0. 0. 0. 0. 1. 0. 1. 1. 0. 1. 1. 0. 1. 0. 1. 1. 1. 1. 0. 1.
 1. 1. 1. 0. 1. 0. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0.]
mean_value: 0.703125
------------------------------------------------------------
RandomForestClassifier()
score: [1. 1. 1. 1. 0. 0. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 0. 1. 1. 0. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 0.]
mean_value: 0.796875
------------------------------------------------------------
GradientBoostingClassifier()
score: [1. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 1. 1. 0. 1. 0. 0. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0.]
mean_value: 0.828125
------------------------------------------------------------
BaggingClassifier()
score: [1. 1. 1. 1. 0. 0. 0. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 0. 1. 0. 0. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 0. 0. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0.]
mean_value: 0.78125
------------------------------------------------------------
AdaBoostClassifier()
score: [1. 1. 0. 1. 0. 0. 1. 1. 1. 0. 1. 1. 0. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 0. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0.]
mean_value: 0.8125
------------------------------------------------------------
LogisticRegression()
score: [1. 1. 0. 1. 0. 0. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 0.
 1. 1. 1. 0. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 1. 0. 0. 1.
 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 0.]
mean_value: 0.734375
------------------------------------------------------------
SVC()
score: [1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0.
 0. 0. 0. 0. 0. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 0. 1. 1. 1. 0. 1. 0. 1. 1. 1. 0. 0. 1. 1.]
mean_value: 0.71875
------------------------------------------------------------
KNeighborsClassifier()
score: [1. 1. 0. 1. 1. 1. 0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 0.
 0. 0. 1. 0. 0. 1. 0. 1. 1. 0. 1. 1. 1. 0. 1. 0. 1. 1. 1. 0. 1. 1. 1. 1.
 1. 1. 0. 1. 1. 1. 0. 1. 0. 1. 0. 1. 0. 0. 1. 1.]
mean_value: 0.640625
------------------------------------------------------------
GaussianNB()
score: [1. 1. 0. 1. 0. 0. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1.
 0. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0.]
mean_value: 0.796875
------------------------------------------------------------

¶

  • Leave One Out also good this models
  • But 4 models are beter in this movment, there are
  • RandomForestClassifier mean value is :0.81
  • GradientBoostingClassifier mean value :0.81
  • BaggingClassifier mean value :0.84
  • AdaBoostClassifier mean value :0.81
In [ ]:
 

ShuffleSplit

In [63]:
ssp  = ShuffleSplit(n_splits=6,test_size=0.2)
for i in model:
    score4 =cross_val_score(estimator=i,X=x,y=y,cv=ssp)
    print(i)
    print(f'score:',score4)
    print(f'mean_value:',score4.mean())
    print(f'min_value:',score4.min())
    print('------'*10)
DecisionTreeClassifier()
score: [0.84615385 0.76923077 0.84615385 0.76923077 0.61538462 0.76923077]
mean_value: 0.7692307692307693
min_value: 0.6153846153846154
------------------------------------------------------------
RandomForestClassifier()
score: [0.76923077 0.76923077 0.84615385 0.84615385 0.84615385 0.92307692]
mean_value: 0.8333333333333334
min_value: 0.7692307692307693
------------------------------------------------------------
GradientBoostingClassifier()
score: [0.61538462 0.69230769 0.76923077 0.61538462 0.84615385 0.84615385]
mean_value: 0.7307692307692308
min_value: 0.6153846153846154
------------------------------------------------------------
BaggingClassifier()
score: [0.76923077 0.76923077 0.92307692 0.92307692 0.69230769 0.84615385]
mean_value: 0.8205128205128206
min_value: 0.6923076923076923
------------------------------------------------------------
AdaBoostClassifier()
score: [0.84615385 0.92307692 0.84615385 0.84615385 0.76923077 0.84615385]
mean_value: 0.8461538461538461
min_value: 0.7692307692307693
------------------------------------------------------------
LogisticRegression()
score: [0.76923077 0.69230769 0.53846154 0.61538462 0.69230769 0.61538462]
mean_value: 0.6538461538461539
min_value: 0.5384615384615384
------------------------------------------------------------
SVC()
score: [0.92307692 0.76923077 0.76923077 0.69230769 0.53846154 0.76923077]
mean_value: 0.7435897435897436
min_value: 0.5384615384615384
------------------------------------------------------------
KNeighborsClassifier()
score: [0.76923077 0.69230769 0.69230769 0.61538462 0.61538462 0.61538462]
mean_value: 0.6666666666666666
min_value: 0.6153846153846154
------------------------------------------------------------
GaussianNB()
score: [0.76923077 0.84615385 0.76923077 0.76923077 0.92307692 0.84615385]
mean_value: 0.8205128205128204
min_value: 0.7692307692307693
------------------------------------------------------------

¶

  • StratifildKfold also good this models
  • But 3 models min beter in this movment there are
  • RandomForestClassifier mean value :0.84
  • GradientBoostingClassifier mean value :0.76
  • BaggingClassifier mean value :0.78
In [65]:
#create the variable use of model bulding for hyper paramitor tuning
models = [
    
    ('DecisionTreeClassifier',DecisionTreeClassifier()),
    ('RandomForestClassifier', RandomForestClassifier()),
    ('GradientBoostingClassifier', GradientBoostingClassifier()),
    ('BaggingClassifier',BaggingClassifier()),
    ('AdaBoostClassifier',AdaBoostClassifier()),
    ('LogisticRegression',LogisticRegression()),
    ('SVC',SVC()),
    ('KNeighborsClassifier',KNeighborsClassifier()),
    ('GaussianNB',GaussianNB())
]
In [66]:
#create a paramitore grids for GridSearchCV
para_grids= {
    'DecisionTreeClassifier':{
        'criterion':['gini','entropy','log_loss'],
        'splitter':['best','random'],
        'max_depth':[6,8,10,12],
        'min_samples_split':[2,3,4,5,6]
    },
    'RandomForestClassifier':{
       'n_estimators':[10,40,50,60,70,100],
        'criterion':['gini','entropy','log_loss'],
        'max_depth':[6,8,10,12],
        'min_samples_split':[2,3,4,5,6]
    },
    'GradientBoostingClassifier':{
        'n_estimators':[10,40,50,60,70,100],
        'min_samples_split':[2,3,4,5,6]
    },
    'BaggingClassifier':{
        'n_estimators':[10,40,50,60,70,100],
        'max_samples':[6,8,10,12]
    },
    'AdaBoostClassifier':{
        'n_estimators':[10,40,50,60,70,100]
    },
    'LogisticRegression':{
    },
    'SVC':{
    },
    'KNeighborsClassifier':{
        'n_neighbors':[2,3,4,5],
        'weights':['distance','uniform'],
        'algorithm':['ball_tree','kd_tree','brute'],
        'p':[2,3,4,5,6]
    },
    'GaussianNB':{}
}
In [ ]:
 

GridSearchCV

In [68]:
for model_name,model_instance in models:
    
    Gscv = GridSearchCV(estimator=model_instance,param_grid=para_grids[model_name],cv=5,n_jobs=-1)
    
    # Fit the model
    Gscv.fit(x_train,y_train)
    
    
    # Make predictions
    pred = Gscv.predict(x_test)
    
    # Print classification report
    # Get the best estimator
    print(Gscv.best_estimator_)
    print(Gscv.best_score_)
    print('------'*10)
    
DecisionTreeClassifier(criterion='entropy', max_depth=10, splitter='random')
0.8836363636363636
------------------------------------------------------------
RandomForestClassifier(max_depth=6, n_estimators=40)
0.8636363636363636
------------------------------------------------------------
GradientBoostingClassifier(n_estimators=10)
0.7836363636363636
------------------------------------------------------------
BaggingClassifier(max_samples=12, n_estimators=60)
0.8036363636363635
------------------------------------------------------------
AdaBoostClassifier()
0.7854545454545454
------------------------------------------------------------
LogisticRegression()
0.630909090909091
------------------------------------------------------------
SVC()
0.6890909090909091
------------------------------------------------------------
KNeighborsClassifier(algorithm='ball_tree', p=3)
0.7054545454545453
------------------------------------------------------------
GaussianNB()
0.7836363636363636
------------------------------------------------------------

¶

  • Based on the accuracy scores :
  • 4 models are best model for this data set
  • DecisionTreeClassifier accuracy is 0.84
  • RandomForestClassifier accuracy is 0.86
  • GradientBoostingClassifieraccuracy is 0.78
  • BaggingClassifier accuracy is 0.80
In [ ]:
 
In [ ]:
 

RandomizedSearchCV

In [71]:
for model_name,model_instance in models:
    
    rscv = RandomizedSearchCV(estimator=model_instance,param_distributions=para_grids[model_name],cv=5,n_jobs=-1)
    
    # Fit the model
    rscv.fit(x_train,y_train)
    
    
    # Make predictions
    pred = rscv.predict(x_test)
    
   
    # Get the best estimator
    print(rscv.best_estimator_)
    #  get Print best_score
    print(rscv.best_score_)
    print('------'*10)
    
DecisionTreeClassifier(criterion='log_loss', max_depth=8)
0.8054545454545454
------------------------------------------------------------
RandomForestClassifier(criterion='entropy', max_depth=12, min_samples_split=4,
                       n_estimators=60)
0.8236363636363636
------------------------------------------------------------
GradientBoostingClassifier(n_estimators=10)
0.7836363636363636
------------------------------------------------------------
BaggingClassifier(max_samples=6, n_estimators=100)
0.8254545454545456
------------------------------------------------------------
AdaBoostClassifier()
0.7854545454545454
------------------------------------------------------------
LogisticRegression()
0.630909090909091
------------------------------------------------------------
SVC()
0.6890909090909091
------------------------------------------------------------
KNeighborsClassifier(algorithm='kd_tree', p=4)
0.7054545454545453
------------------------------------------------------------
GaussianNB()
0.7836363636363636
------------------------------------------------------------

¶

  • Based on the accuracy scores and their f1 score :
  • 5 models are best model for this data set
  • DecisionTreeClassifier accuracy is 0.78
  • AdaBoostClassifier accuracy is 0.78
  • RandomForestClassifier accuracy is 0.82
  • GradientBoostingClassifier accuracy is 0.78
  • BaggingClassifier accuracy 0.78
In [ ]:
 

I am selecting the RandomForestClassifier model after checking the all model best score

In [74]:
model =RandomForestClassifier(max_depth=6, min_samples_split=6, n_estimators=10)
In [ ]:
 

KFOLD

In [76]:
kfold = KFold(6)
score = cross_val_score(estimator=model,X=x,y=y,n_jobs=-1,cv=6,verbose=0)
In [77]:
#kfold Scores
score
Out[77]:
array([0.81818182, 0.90909091, 0.72727273, 0.90909091, 0.9       ,
       0.6       ])
In [78]:
# mean of the Score
score.mean()
Out[78]:
0.8106060606060606

StratifiedKFold

In [80]:
SKF= StratifiedKFold(6)
score = cross_val_score(estimator=model,X=x,y=y,cv=6,n_jobs=-1,verbose=0)
In [81]:
score
Out[81]:
array([0.72727273, 1.        , 0.63636364, 0.81818182, 0.9       ,
       0.7       ])
In [82]:
# mean of the Score
score.mean()
Out[82]:
0.796969696969697
In [ ]:
 

LeaveOneOut

In [84]:
LOO = LeaveOneOut()
score = cross_val_score(estimator=model,X=x,y=y,cv=LOO,verbose=0)
In [85]:
score
Out[85]:
array([1., 1., 1., 1., 0., 0., 0., 1., 1., 0., 1., 1., 0., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 0., 1., 1., 0.,
       1., 1., 1., 1., 0., 1., 1., 1., 1., 0., 1., 0., 0., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 0., 0.])
In [86]:
# mean of the Score
score.mean()
Out[86]:
0.765625
In [ ]:
 

ShuffleSplit

In [88]:
SFP= ShuffleSplit(n_splits=6,test_size=0.24)
score = cross_val_score(estimator=model,X=x,y=y,cv=SFP)
In [89]:
score
Out[89]:
array([0.75  , 0.5625, 0.75  , 0.875 , 0.625 , 0.8125])
In [90]:
# mean of the Score
score.mean()
Out[90]:
0.7291666666666666
In [ ]:
 

Neural Networks

In [92]:
model=tf.keras.Sequential([
    tf.keras.layers.Dense(17,input_shape=(x_train.shape[1],),activation='relu'),
    tf.keras.layers.Dense(8,activation='relu'), # hiden layers of Neural Networks
    tf.keras.layers.Dense(6,activation='relu'),  # hiden layers of Neural Networks
    tf.keras.layers.Dense(1,activation='sigmoid')
])

Neural Network Model

  • Implemented using Keras in TensorFlow
  • 4 layers:
    1. 17 neurons, ReLU activation, input shape (batch_size, x_train.shape[1])
    2. 8 neurons, ReLU activation
    3. 6 neurons, ReLU activation
    4. 1 neuron, sigmoid activation (binary classification)
  • Architecture: input -> 17 neurons -> 8 neurons -> 6 neurons -> output
  • Simple feedforward neural network with three hidden layers
In [94]:
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

Compiling a Neural Network Model

The code compiles a neural network model using Keras and TensorFlow. The model is configured with:

  • Adam optimizer to update weights during training
  • Binary cross-entropy loss function for binary classification problems
  • Accuracy metric to evaluate model performance
In [96]:
hist = model.fit(x_train,y_train,epochs=20,batch_size=8,verbose=0)
In [97]:
hist_df = pd.DataFrame(hist.history) 
In [98]:
hist_df.head()
Out[98]:
accuracy loss
0 0.490196 25.900238
1 0.490196 6.817887
2 0.607843 3.752803
3 0.568627 3.550312
4 0.529412 1.501967
In [99]:
#ploting of the data history
hist_df.plot()
Out[99]:
<Axes: >
No description has been provided for this image
In [ ]:
 
In [100]:
loss,accuracy = model.evaluate(x_test,y_test)
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 220ms/step - accuracy: 0.5385 - loss: 0.7602

Evaluating Neural Network Model Performance

This code line evaluates a trained neural network model using the test dataset.

  • loss stores the difference between predicted and actual output, measuring model performance.
  • accuracy measures how often predictions match true labels.

model.evaluate() takes two inputs: x_test (input data) and y_test (true output).

The function returns two values: loss and accuracy, stored in loss and accuracy variables.

Example output: loss:0.8549, accuracy: 0.6154 (model has an average loss of 0.8549 and accuracy of 61%).

In [102]:
accuracy
Out[102]:
0.5384615659713745
In [ ]:
 
In [ ]:
 
In [ ]: